It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5-minute intervals throughout the day. The data consists of two months of data from an anonymous individual collected during the months of October and November 2012 and include the number of steps taken in 5 minute intervals each day.
Data Source: The dataset was provided as part of the Reproducible Research course by Johns Hopkins University on Coursera. It can be downloaded from the following URL: activity.zip.
Purpose: This analysis of the dataset aims to explore daily activity patterns, address missing data, and compare activity across weekdays and weekends.
Before we begin, I’m going to use the following code to set up the
global environment for my markdown file. This code chunk will load all
required libraries and will configure knitr options. These
steps ensure that my document will be both clean and reproducible.
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(hrbrthemes)
This code imports and preprocesses the activity dataset. It reads the
CSV file into activity_data, converts the date
column to the Date class, and then uses str()
and summary() to confirm the structure and basic statistics
of the data. This step ensures the dataset is properly formatted and
ready for further analysis:
activity_data <- read.csv("activity.csv", stringsAsFactors = FALSE)
activity_data$date <- as.Date(activity_data$date, format = "%Y-%m-%d")
str(activity_data)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
summary(activity_data)
## steps date interval
## Min. : 0.00 Min. :2012-10-01 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
## Median : 0.00 Median :2012-10-31 Median :1177.5
## Mean : 37.38 Mean :2012-10-31 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
## Max. :806.00 Max. :2012-11-30 Max. :2355.0
## NA's :2304
The str() and summary() functions provide a
structural overview and summary statistics of the dataset, helping us
make initial observations. First, we notice a significant number of
missing values (NAs), which we will address in a later step
to ensure the integrity of our analysis. Second, the dataset charts
daily steps across time intervals, offering insight into when people are
most active during the day and week. These observations lay the
foundation for exploring activity patterns and identifying peak
intervals in the next steps.
This code aggregates total steps by date using
aggregate() and removes missing values. The resulting
daily_steps dataset contains one row per day, with a sum of
all steps taken that day. Converting date to a factor supports
categorical analysis and plotting by day.
daily_steps <- aggregate(steps ~ date, data = activity_data, FUN = sum, na.rm = TRUE)
daily_steps <- daily_steps %>% mutate(date = factor(date))
In this next section, we create a histogram to visualize the total number of steps taken each day. The data is grouped into bins of 2,500 steps, and each bar is colored based on the specific day. I want the chart to be clear and easy to read, and so I set the y-axis to show a maximum of 20 days. This visualization provides a quick overview of the distribution of daily total steps:
ggplot(daily_steps, aes(x = steps, fill = date)) +
geom_histogram(breaks = seq(0, 25000, by = 2500), show.legend = FALSE) +
ylim(0, 20) +
labs(title = "Distribution of Daily Step Counts Over Two Months.",
x = "Total Steps Taken", y = "Number of Days")
In this part, we calculate the average (mean) and the
middle value (median) of the total steps taken each day
using the daily_steps dataset. These numbers help us
understand the typical daily activity levels by summarizing the central
trend of the data:
mean_steps <- mean(daily_steps$steps)
median_steps <- median(daily_steps$steps)
mean_steps
## [1] 10766.19
median_steps
## [1] 10765
This finding indicates that the individual’s activity peaks in the morning, likely reflecting a consistent routine.
This code calculates the average steps per interval across all days and identifies the interval with the highest mean steps. These results help pinpoint when daily activity tends to peak.
interval_steps <- aggregate(steps ~ interval, data = activity_data, FUN = mean, na.rm = TRUE)
max_interval <- interval_steps[which.max(interval_steps$steps), "interval"]
These steps convert numeric intervals into a time-of-day format
(HH:MM) by zero-padding the interval values, extracting hours and
minutes, and constructing a POSIXct datetime. This enables
the plotting of intervals as actual times. The maximum interval is
similarly processed to produce a human-readable time_label
for annotations in later visualizations.
interval_padded <- sprintf("%04d", interval_steps$interval)
hours <- as.integer(substr(interval_padded, 1, 2))
minutes <- as.integer(substr(interval_padded, 3, 4))
interval_steps$time_of_day <- as.POSIXct(
sprintf("1970-01-01 %02d:%02d:00", hours, minutes),
format = "%Y-%m-%d %H:%M:%S", tz = "GMT"
)
max_interval_str <- sprintf("%04d", max_interval)
max_hour <- as.integer(substr(max_interval_str, 1, 2))
max_minute <- as.integer(substr(max_interval_str, 3, 4))
max_time <- as.POSIXct(sprintf("1970-01-01 %02d:%02d:00", max_hour, max_minute),
format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
time_label <- paste0(max_hour, ":", sprintf("%02d", max_minute))
Prior to plotting this data, I want to calculate the overall mean steps across all intervals, providing a baseline reference that can be uses in the next step for annotating the plot:
mean_steps_value <- mean(interval_steps$steps)
Next, in order to visualize the data and to try and make sense of any average daily step patterns, I created a time-series graph that shows the average number of steps taken at different times of the day. The graph highlights the specific time interval with the highest average steps using both text and a red point marker. I chose to add a horizontal blue line to represent the overall average number of steps across all intervals. Additionally, I customized the appearance of the plot with a clean theme, used vibrant colors to enhance clarity and visual appeal, and set the x-axis to display time in 4-hour increments.
p <- interval_steps %>%
ggplot(aes(x = time_of_day, y = steps)) +
geom_line(color = "#dd00f9", size = 0.5) +
annotate("text", x = max_time, y = max(interval_steps$steps) + 5,
label = paste("Max interval:", time_label),
color = "black") +
annotate("point", x = max_time, y = max(interval_steps$steps),
size = 3, shape = 21, fill = "transparent", color = "red") +
geom_hline(yintercept = mean_steps_value, color = "#0099f9", size = 0.75) +
labs(title = "Average Number of Steps per Time of Day",
x = "Time of Day",
y = "Average Number of Steps") +
theme_ipsum(base_size = 14) +
theme(
plot.title = element_text(color = "#0099f9", size = 16, face = "bold"),
axis.text = element_text(color = "black"),
axis.title = element_text(color = "black"),
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA)
) +
scale_x_datetime(date_breaks = "4 hours", date_labels = "%H:%M", expand = c(0,0))
Finally, in order to make the plot more dynamic and to allow for
users to more deeply explore the data, I transformed the static plot
p created with ggplot2 into an interactive
visualization using Plotly’s ggplotly() function. This
modification allows users to engage with the plot through features like
tooltips, zooming, and panning.
fig <- ggplotly(p)
fig
In the original dataset, there are a number of days/intervals where
there are missing values (coded as NA). My goal here is to
calculate and report the total number of missing values in the dataset
(i.e. the total number of rows with NAs), and then to
devise a strategy for filling in all of the missing values. I begin by
using the following code to first check each entry in the steps column
for NA, and then to create a lookup vector
interval_mean_map that associates each interval with its
average number of steps. This new vector that I’ve created simplifies
the process of filling in missing steps values (NA) by
providing a straightforward way to replace them with the average steps
for each specific interval.
total_na <- sum(is.na(activity_data$steps))
total_na
## [1] 2304
interval_mean_map <- setNames(interval_steps$steps, interval_steps$interval)
The next step is to create a copy of activity_data named
imputed_data, which identifies which steps
entries are NA, and assigns each missing value the average
steps from interval_mean_map based on the interval. This
imputation ensures that the dataset is complete, allowing for accurate
subsequent analyses without the bias that could result from missing
data.
imputed_data <- activity_data
missing_indices <- is.na(imputed_data$steps)
imputed_data$steps[missing_indices] <- interval_mean_map[as.character(imputed_data$interval[missing_indices])]
My next step was to calculate the total number of steps taken each day by summing up all the steps across every interval for each date, and then to create a histogram in order to visualize how these daily steps are distributed. After experimenting with different plot dimensions, I decided to use a bin width of 1,000 steps, and limited the y-axis to 20, so that the plot would more clearly and effectively show the number of days that fall into each step range.
daily_steps_imputed <- aggregate(steps ~ date, data = imputed_data, FUN = sum)
daily_steps_imputed <- daily_steps_imputed |>
mutate(date = factor(date))
ggplot(daily_steps_imputed, aes(x = steps, fill = date)) +
geom_histogram(breaks = seq(0,25000, by = 1000), show.legend = FALSE) +
ylim(0, 20) +
labs(title = "Total Steps per Day (Imputed Data)", x = "Total Steps", y = "Count of Days")
Next, let’s calculate the mean and median of the total daily steps using the imputed dataset, to see if both values undergo a significant change since we first observed the values in Step 2:
mean_steps_imputed <- mean(daily_steps_imputed$steps)
median_steps_imputed <- median(daily_steps_imputed$steps)
mean_steps_imputed
## [1] 10766.19
median_steps_imputed
## [1] 10766.19
As you can see, this method for handling missing step counts involved replacing each missing value with the average number of steps for that specific time period. This approach ensured that the overall daily average and median step counts remained the same as in our original analysis. By using representative values for each time interval, we maintained the consistency and reliability of our results. This demonstrates the importance of careful data handling in preserving the accuracy of our findings.
Here, we categorize each day as either a “weekday” or a “weekend”. I
used the weekdays() function to determine the day of the
week for each date. If the day is Saturday or Sunday, it’s labeled as
“weekend”; otherwise, it’s labeled as “weekday”. Next, I converted the
day_type column to a factor with the levels ordered as
“weekday” and “weekend” to ensure consistent plotting and analysis in
later steps.
imputed_data$day_type <- ifelse(weekdays(imputed_data$date) %in% c("Saturday","Sunday"),
"weekend", "weekday")
imputed_data$day_type <- factor(imputed_data$day_type, levels = c("weekday", "weekend"))
The next step is to calculate the average number of steps taken
during each time interval, distinguishing between weekdays and weekends.
I chose the base R function aggregate() to group the data
by both interval and day type, then compute the average steps for each
group. After that, we convert the interval numbers into a standard time
format by adding leading zeros and extracting the hour and minute
components. This transformation allows us to plot the data on a
continuous time axis, making it easier to compare activity patterns
between weekdays and weekends.
interval_daytype <- aggregate(steps ~ interval + day_type, data = imputed_data, FUN = mean)
interval_padded <- sprintf("%04d", interval_daytype$interval)
hours <- as.integer(substr(interval_padded, 1, 2))
minutes <- as.integer(substr(interval_padded, 3, 4))
interval_daytype$time_of_day <- as.POSIXct(
sprintf("1970-01-01 %02d:%02d:00", hours, minutes),
format = "%Y-%m-%d %H:%M:%S", tz = "GMT"
)
This code plots the average number of steps taken throughout the day,
comparing patterns between weekdays and weekends. It maps
time_of_day to the x-axis and steps to the y-axis, with
line colors distinguishing between “weekday” and “weekend” activities.
The plot is divided into separate panels for weekdays and weekends using
facet_wrap, facilitating an easy comparison of activity
trends. The x-axis is formatted to display military time in 4-hour
intervals for clarity. Custom colors (#FF5733 for weekdays and #00BFFF
for weekends) enhance visual distinction, and the
theme_minimal provides a clean, professional look.
Additional theme adjustments ensure that titles and labels are bold and
colored appropriately, while removing the legend keeps the focus on the
faceted comparisons.
ggplot(interval_daytype, aes(x = time_of_day, y = steps, color = day_type)) +
geom_line(size = 1.2) +
facet_wrap(~ day_type, ncol = 1) +
scale_x_datetime(date_breaks = "4 hours", date_labels = "%H:%M", expand = c(0,0)) +
scale_color_manual(values = c("weekday" = "#FF5733",
"weekend" = "#00BFFF")) +
labs(title = "Avg. Steps by Time of Day for Weekdays vs. Weekends",
x = "Time of Day",
y = "Average Number of Steps") +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 16, color = "#0099f9"),
strip.text = element_text(face = "bold", size = 14, color = "black"),
axis.title = element_text(color = "black", face = "bold"),
axis.text = element_text(color = "black"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "gray80"),
legend.position = "none"
)
When comparing the average number of steps taken throughout the day, clear differences emerge between weekdays and weekends. On weekdays, there’s a noticeable increase in activity during the morning and evening hours, likely reflecting the routines of getting ready for work or school and winding down afterward. In the middle of the day, activity levels tend to drop, possibly due to periods of sedentary work or study. Conversely, weekends exhibit a more balanced and steady level of activity throughout the day, without the sharp peaks seen on weekdays. This suggests that weekends offer a more relaxed schedule, allowing for consistent movement and a variety of activities without the structured demands of weekday routines.
In conclusion, this analysis demonstrated the importance of addressing missing data to ensure accurate results. By examining daily and interval-level activity, we uncovered distinct patterns, such as higher activity peaks during weekdays compared to more evenly distributed activity on weekends. These findings offer insights into daily routines and the role of structured schedules in physical activity.
This R Markdown document follows reproducible research principles by providing clear and well-documented code for each analytical step, including data loading, preprocessing, visualization, and interpretation. The organized sections are designed to support replication and verification of the results.
- Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
- Hadley Wickham, Romain François, Lionel Henry, and Kirill Müller (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.2.
- Carson Sievert and Hadley Wickham (2023). plotly: Create Interactive Web Graphics via ‘plotly.js’. R package version 4.10.0.
- Jeroen Ooms (2022). hrbrthemes: Extra Themes, Theme Components and Utilities for ‘ggplot2’. R package version 0.8.0.
- Xie, Y. (2015). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.22.